evidence document
- North America > United States (0.68)
- Europe > United Kingdom > Wales (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.67)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada (0.04)
Citation Failure: Definition, Analysis and Efficient Mitigation
Buchmann, Jan, Gurevych, Iryna
Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (10 more...)
- North America > United States (0.68)
- Europe > United Kingdom > Wales (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG
Ki, Dayeon, Carpuat, Marine, McNamee, Paul, Khashabi, Daniel, Yang, Eugene, Lawrie, Dawn, Duh, Kevin
Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior. Retrieval-Augmented Generation (RAG) systems have become a core component of modern large language model (LLM) pipelines, enabling models to answer knowledge-intensive queries by supplementing their limited parametric knowledge with external information (Lewis et al., 2020; Karpukhin et al., 2020; Gao et al., 2024). Given that over 50% of digital content is produced in languages other than English (Statista, 2025), recent work has extended these systems to multilingual RAG (mRAG) settings, which handle queries and documents in languages beyond English (Chirkova et al., 2024; Wu et al., 2024). Despite recent advances, prior work highlights a key challenge in mRAG systems: language preference - a systematic tendency of models to favor sources written in certain languages during generation (Park & Lee, 2025). Understanding this behavior is crucial, as citation patterns shape both the information users see and the languages prioritized in multilingual knowledge access. Existing approaches to measuring language preference, however, often fail to capture citation correctness. In short-form mRAG, preference has been estimated via information overlap (Sharma et al., 2025) or embedding similarity (Park & Lee, 2025), which do not directly account for correctness. In long-form mRAG, where outputs contain in-line citations (Zheng et al., 2025; Xu & Peng, 2025), preference has typically been measured by comparing citation frequencies against the language distribution of retrieved documents.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.67)
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
Chen, Zijian, Ma, Xueguang, Zhuang, Shengyao, Nie, Ping, Zou, Kai, Liu, Andrew, Green, Joshua, Patel, Kshama, Meng, Ruoxi, Su, Mingyi, Sharifymoghaddam, Sahel, Li, Yanxi, Hong, Haoran, Shi, Xinyu, Liu, Xuye, Thakur, Nandan, Zhang, Crystina, Gao, Luyu, Chen, Wenhu, Lin, Jimmy
Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.
- Europe > Austria > Vienna (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Oceania > Australia > Queensland (0.04)
- (6 more...)